Introduction

Many relationships, such as social connections, communication networks, and biological pathways, can be represented as networks. Network visualizations are valuable exploratory tools that can summarize large datasets and provide a concise representation of data structures before quantitative models are applied. For example, it can enable us to identify likely key actors, detect potential clusters or communities and observe the overall structure of interactions, facilitating hypothesis generation and paving the way for more in-depth quantitative modeling and analysis.

More about the data

This short tutorial shows how to use tidygraph, and ggraph libraries in R to easily create and customize network visualizations. It uses real-world, messy data from one of my collaborative research projects where we study the Twitter follower-followee connections among 4000+ state legislators in the US, comprising ~160,000 ties. (I collected follower information using Twitter’s API and transformed it into network data, where a tie exists between legislator accounts i and j if i follows j. This is a directed network.).

The tutorial highlights the utility of network visualization in highlighting patterns within this follower network, aiding in the identification of clusters based on geographic, demographic and partisan affiliations, pinpointing some interesting properties of central nodes within the network, and recognizing the overall structure of the connections, among other observations. While no hard conclusions should be drawn from the visualization alone, it provides an accessible and concise summary of the data set that is easy to share with others.

Note: My laptop has 18GB of RAM, and it took about 1 minute to render each plot.

Load required libraries

library(dplyr)
library(ggplot2)
library(igraph)
library(ggraph)
library(tidygraph)

Load data

# Read edge list and node data from RDS files
follower_edges <- readRDS("data/followers_edgelist_R1.Rds")
nodes <- readRDS("data/cleaned_nodes_R1.Rds")

Let’s look at a few rows from the edge list: The edge list comprises two columns, and each row represents an edge specified using the source node and target node. If i follows j, a tie exists between them and is recorded as an edge (i->j) where i is under column name follower_id and j is under the column name legislator_id.

# Show first 3 rows of the edge list dataframe
head(follower_edges, 3)
##     follower_id  legislator_id
## 1 str_963765775 str_2873254919
## 2  str_29012641 str_2873254919
## 3 str_123577910 str_2873254919

Let’s see how many edges are in this network:

# Display dimensions of the edge list dataframe
dim(follower_edges)
## [1] 159346      2

159346

Let’s look at a few rows from the node data: This node data has demographic (gender, race), political (party), geographic (state, contiguity) information associated with each node (legislator) in the data set.

# Show first 3 rows of the node dataframe
head(nodes[,c(-2, -4)], 3)
##           str_id   state chamber party state.abb party3 index  race gender
## 1 str_2873254919 Alabama       H     R        AL      R     1 White   male
## 2 str_1089892711 Alabama       H     R        AL      R     2 White   male
## 3  str_474388304 Alabama       H     R        AL      R     3 White female
##        mds1 in_subnet
## 1 -1.430225         1
## 2 -1.430225         1
## 3 -1.430225         1

Calculate the in-degree centrality and add node label information

Here we add some additional node information that will be useful for setting some plot aesthetics later (shape, size, label):

First, we calculate the in-degree (number of incoming follower ties) for each node (legislator).

We create a directed graph object named g_follower using the graph_from_data_frame() function from the igraph package: - d = follower_edges specifies the dataframe containing information about the edges of the graph. - vertices = nodes specifies the dataframe containing information about the vertices (nodes) of the graph.

Then, we use the degree() function on the graph created above and add this information back to the nodes dataframe. (There are other ways to calculate the in-degree value that do not involve creating a graph.)

Second, using the in-degree measure, we identify the top 5 legislators with the highest number of followers within each state and use their state abbreviations as the labels in the node dataframe, setting the remaining labels to NA. (The number 5 is arbitrary; the goal is to avoid too many overlapping labels in the dense network while retaining useful information to identify state clusters, if any are present, and to see how those with the most followers in a state are positioned in the network.)

# Create directed graph object from edge list and node data
g_follower <- graph_from_data_frame(d = follower_edges, vertices = nodes, directed = TRUE)

# Calculate in-degree for each node (number of incoming follower ties)
V(g_follower)$indegree <- degree(g_follower, mode = 'in', loops = FALSE)
nodes$follower_indegree <- V(g_follower)$indegree

# Find the top 5 most central nodes within each state and assign labels
top_5_follower <- nodes %>% 
  group_by(state) %>% 
  top_n(5, wt=follower_indegree) %>% 
  mutate(follower_labels = state.abb)

# Join labels back into nodes dataframe
nodes <- nodes %>% 
  left_join(top_5_follower[c('str_id','follower_labels')], by='str_id')

# Display first 3 rows of the nodes dataframe
head(nodes[,c(10:15)], 3)
##    race gender      mds1 in_subnet follower_indegree follower_labels
## 1 White   male -1.430225         1                25            <NA>
## 2 White   male -1.430225         1                22            <NA>
## 3 White female -1.430225         1                40              AL

Some descriptive information

In degree distribution looks right skewed meaning some legislators are attracting a disproportionately large number of followers compared to the rest.

# Plot indegree distribution 
ggplot(nodes, aes(x = follower_indegree)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black", alpha = 0.7) +
  geom_vline(aes(xintercept = mean(follower_indegree)), color = "red", linetype = "dashed", linewidth = 1) +
  labs(title = "Distribution of Follower Indegree",
       x = "Follower Indegree",
       y = "Frequency") +
  theme_minimal()

We have 3 values for party affiliation:

table(nodes$party3)
## 
##    D    I    R 
## 2244   11 1853

2 values for chamber the legislators belong to:

table(nodes$chamber)
## 
##    H    S 
## 2938 1170

And 50 states:

table(nodes$state.abb)
## 
##  AK  AL  AR  AZ  CA  CO  CT  DE  FL  GA  HI  IA  ID  IL  IN  KS  KY  LA  MA  MD 
##  24  53  77  78 115  83  83  21 132 135  25  74  31 116  74  71  87  78 157 127 
##  ME  MI  MN  MO  MS  MT  NC  ND  NE  NH  NJ  NM  NV  NY  OH  OK  OR  PA  RI  SC 
##  44  93 147 126  74  47 118  28  22 137  82  46  55 167 100  80  53 171  67 101 
##  SD  TN  TX  UT  VA  VT  WA  WI  WV  WY 
##  23  83 156  63 104  30  76  96  57  21
length(table(nodes$state.abb))
## [1] 50

Create the graph object and set the layout:

Now, using the updated node information, let’s recreate the graph but this time using the tbl_graph() function from the tidygraph package. The tidygraph package is designed to provide a tidy data structure for graph and network data, allowing us to manipulate and analyze graphs using the same principles as the tidyverse packages, like dplyr and ggplot2 which is very nice and keeps the code neat.

# Recreate the graph and create a tidygraph object
g_follower_tidy <- tbl_graph(
    nodes = nodes,
    edges = follower_edges,
    node_key = "str_id") %>%
  activate(nodes) %>%  # Sets context to nodes -> subsequent operations are performed on nodes
  filter(!node_is_isolated())  # Removes nodes that are isolated/do not have any follower edges

Next we use the create_layout function from the ggraph package, which defines how the nodes and edges should be arranged in the plot. The ggraph package is an extension of ggplot2 specifically designed for creating network visualizations. We use the Fruchterman-Reingold algorithm (“fr”) to set the layout of this graph.

Fruchterman-Reingold layout algorithm (Fruchterman T.M.J., Reingold E.M. 1991)., is a force-directed layout algorithm commonly used for visualizing graphs. In short, it treats the graph as a physical system where nodes are conceptualized as electrically charged particles that repel each other and the basic idea is to minimize the energy of this system. Note, the calculation of forces for all pairs of nodes can be computationally expensive, especially for large graphs and it can take some time to render the visualization.

# Set seed for layout reproducibility
set.seed(10)
# Create layout using the Fruchterman-Reingold algorithm from igraph
follower_layout <- create_layout(g_follower_tidy, layout = "igraph", algorithm = "fr")

Create the Visualiztion

Let’s start by visualizing the basic graph structure without incorporating any additional information on other variables.

# Plot the basic graph structure with default settings
ggraph(follower_layout) +
  geom_edge_link() +
  geom_node_point()

Okayyy….Let’s reduce the color intensity of the edges using the alpha option to see if we can make the nodes visible.

# Plot the graph structure with reduced edge intensity (alpha)

ggraph(follower_layout) +
  geom_edge_link(alpha=.01) + # Reduce edge intensity using alpha
  geom_node_point()

That worked really well!

Next, let’s add information about the party and chamber affiliation of each node (legislator) using the color and shape options. Let’s color the nodes based on party (Democrat, Republican, Independent) and assign node shapes based on the legislator’s chamber (House or Senate). Note default colors and shapes will be chosen if not explicitly provided:

# This code adds color and shape aesthetics to represent the party and chamber information of each node.

ggraph(follower_layout) +
  geom_edge_link(alpha = 0.01) +
  geom_node_point(
    aes(
      color = party3, # Color nodes based on party affiliation (D, R or I)
      shape = chamber # Shape nodes based on chamber (House or Senate)
    )
  ) 

Finally, let’s adjust the size of the nodes based on their follower in-degree (higher values = bigger node size) and add state labels to the top 5 nodes in each state with the highest in-degree. Let’s also customize the color of the nodes and legend labels.

# Plot the graph with additional aesthetics for color, shape, size, and labels

ggraph(follower_layout) +
  geom_edge_link(alpha = 0.01) +
  geom_node_point(aes(color = party3, # Color nodes based on party affiliation (D, R or I)
                      shape = chamber, # Shape nodes based on chamber (House or Senate)
                      alpha = follower_indegree, # Adjust node transparency based on follower indegree
                      size = follower_indegree # Adjust node size based on follower indegree
                      )) +
  scale_color_manual("Party",
                     values = c(D = "dodgerblue", # Assign color for Democrat
                                R = "firebrick2", # Assign color for Republican
                                I = "yellow")) + # Assign color for Independent
  geom_node_text(aes(label = follower_labels,
                     size = follower_indegree/3
                     )) +
  theme_graph(base_family = 'Helvetica') +
  guides(
    alpha = guide_legend(title="In-degree"),  # Customizing Legend labels 
    color = guide_legend(title = "Party"),
    shape = guide_legend(title = "Chamber"),
    size = guide_legend(title = "In-degree")
  )
## Warning: Removed 3782 rows containing missing values or values outside the scale range
## (`geom_text()`).

That looks good! And with a few easy steps we are able to create an impactful visualization and highlights interesting properties of this data.

Some Observations

We could have also set aesthetics based on different variables, like coloring the nodes by state:

# Plot the graph where nodes are colored by state 

ggraph(follower_layout) +
  geom_edge_link(alpha = 0.01) +
  geom_node_point(aes(color = state, # Color nodes based on state this time
                      shape = chamber)) # Shape nodes based on chamber (House or Senate)

Or by gender:

# Plot the graph where nodes are colored by state 

ggraph(follower_layout) +
  geom_edge_link(alpha = 0.01) +
  geom_node_point(aes(color = gender, # Color nodes based on gender this time 
                      shape = chamber # Shape nodes based on chamber (House or Senate)
                      ))